De novo prediction of RNA 3D structures with deep generative models

We present a Deep Learning approach to predict 3D folding structures of RNAs from their nucleic acid sequence. Our approach combines an autoregressive Deep Generative Model, Monte Carlo Tree Search, and a score model to find and rank the most likely folding structures for a given RNA sequence. We show that RNA de novo structure prediction by deep learning is possible at atom resolution, despite the low number of experimentally measured structures that can be used for training. We confirm the predictive power of our approach by achieving competitive results in a retrospective evaluation of the RNA-Puzzles prediction challenges, without using structural contact information from multiple sequence alignments or additional data from chemical probing experiments. Blind predictions for recent RNA-Puzzle challenges under the name “Dfold” further support the competitive performance of our approach.


Introduction
Ribonucleic acids (RNAs) are polymeric molecules that can act as information messengers, mediators, and regulators in the expression of genes.The specific function of RNA is tightly associated with the 3D folding structure, which in turn is determined by its sequence of nucleobases.The accurate prediction of RNA 3D structure from its primary sequence would advance the design of synthetic RNA for biotechnological or therapeutic purposes and help to improve RNA vaccines or RNA based gene therapies.Using Deep Generative Models for RNA structure prediction circumvents the complex tasks of formulating an energy function from which structural candidates can be generated but requires a sufficient amount of examples to learn the complex conformational states RNA molecules can take.The functional diversity of RNA in living cells is a consequence of its ability to form specific three-dimensional (3D) folding structures that allow for interaction with DNA, RNA, proteins, and small molecules [1,2].Understanding the relationships between sequence, structure, and function of RNA is essential for understanding the function of living cells and particularly useful for the design of RNA therapeutics [3,4].Furthermore, the automated prediction and the targeted design of RNA tertiary structure would be an important step to further improve RNA therapeutics and to advance the field of RNA biotechnology in general [5].
Algorithms for predicting RNA 3D structure from nucleotide sequence [6] are dominated by four approaches: (i) template based methods such as FARFAR2 [7,8] and 3dRNA2 [9,10], which decompose known structures into 1-to 3-mer fragments and combinatorially reassemble them to find the structures with lowest molecular interaction energies [11], (ii) coarse grained force field methods that minimise interaction energy by stochastically displacing groups of atoms like SimRNA and RNA-BRiQ [12,13], (iii) comparative modelling methods that are based on the availability of homologous structures, and (iv) machine learning approaches [14,15] that combine sequence and chemical probing information to generate candidate structures.Despite the steady increase in affordable computing power and the use of more accurate energy functions [11,16], the de novo structure prediction of larger RNAs (>80nt) still remains challenging [7].For secondary structure prediction deep learning based models like RNA-FM, U-Fold and SPOT-RNA have allready surpassed shallow networks and energy based methods [17][18][19].Hence, it is appealing to study deep learning in the tertiary prediction setup and recent studies with DRFold have shown that end-to-end deep learning for tertiary structure prediction can be achieved [20].Historically, the prediction challenges for RNA structure prediction algorithms started with benchmarks for the prediction of small scale structures [21] (2012) up to larger structures in [22] (2015), followed by more complex folds such as riboswitches and ribozymes [23] (2017).In this historic context fragment-assembly methods perform best, giving especially leading structure predictions for larger sequences.
For proteins, the benchmark for predicting 3D structures with atomic resolution is set by deep learning approaches that take sequence information as input and predict both the distances between C α or C β atoms, the dihedral angles of the polypeptide backbone, and the conformation of the side chains [24,25].Here, the accurate prediction of the global folding structure crucially depends on the existence of a sufficient amount of homologous sequences from multiple sequence alignments (MSA), which allows to identify at least some of the residues that are in contact [26].These global structural constraints can be inferred from correlations between amino acid substitution frequencies that arise from an evolutionary selection pressure for stably folded protein structures [27].The ability of deep neural networks to identify and generate complex statistical patterns in high dimensional spaces and to generalise well across training examples makes deep learning approaches conceptually attractive for predicting protein structures.Hence, within the scientific community concerned with RNA, the question on when deep learning will lead to breakthrough has allready been raised [28] However, deep learning approaches are in general data hungry and structure predictions strongly benefits from a sufficiently large number of homologous sequences for each high resolution structural example in the training set.For the same reason, the use of clever data augmentation strategies are crucial to achieve good performance [29].
Structure prediction for RNA shows some fundamental differences to proteins.First, in contrast to almost all proteins, RNAs often fold into different alternative structures under physiological conditions that are either stable or visited over time with high probability [30].Second, for RNA the conformation of the phosphate backbone is strongly constraint by the pairing of nucleobases, whereas for proteins the spatial location of the side chains is strongly constraint by the polypeptid backbone.This difference arises from the fact that the secondary structure of proteins is determined by hydrogen bonds within the peptide backbone, whereas the secondary structure of RNA is determined by hydrogen bonds between nucleobases.Third, as training of large deep learning models requires a large amount of independent training examples, the two orders of magnitude less available structures for RNAs in comparison to proteins implies stronger restrictions on the model complexity for RNA structure prediction.Finally, the less conserved RNA structures make it much harder to identify homologs for MSA and therefore the crucial information about global folding constraints is in many cases not accessible.On the contrary, there exist structural probing methods to estimate the probability of each nucleotide to be part a of a base paring interaction, such as SHAPE [31] or DMS [32].However, unlike MSAs, structural probing methods can only give an estimate if a nucleotide is in contact, but lack direct information about the contact partner.Moreover, structural probing data represents an ensemble average over the structural conformations that a given RNA can take and therefore can provide only useful information if the secondary structure is sufficiently stable.

Materials and methods
To encode 3D RNA structures, we used a rotational invariant representation that was given by the Euclidean distances between nucleotides, with each nucleotide position uniquely determined by a set of 5 selected atoms, where different sets were taken for purines and pyrimidines (Fig 1).We made use of a Vector Quantised Variational Autoencoder [33] (VQ-VAE) to compress the 5 × 5 Euclidean distances between the selected atoms into K classes for each possible nucleotide pair.We refer to these classes as distance classes, as the K = 3 classes we used throughout this work agree well with the qualitative distance measures "near", "intermediate", and "far" (Supplementary Information).The encoded distance classes represent the targets  1.The position of each different nucleotide is determined by five out of eight selected atoms, resulting in an 8 × 8 matrix for each single nucleotide which can be flattened into a vector.The resulting L × L × 64 Euclidean distances for all nucleotide pairs are encoded into L × L × K discrete distance classes by a VQ-VAE [33], we choose K = 3.The generation process uses a Deep Neural Network (DNN) to predict probability values for the distances classes of shape L × L × K, so that for each pixel a via softmax a probabilitiy distribution of the K distance classes is learned.From these predictions a single distance class for a single nucleotide pair is selected according to the MCTS policy (Methods) to iteratively generate a path in the search tree.At each iteration the currently selected distance classes and the sequence information are presented as input to the DNN.Once all distance classes are selected, the Euclidean distances can be recovered by the VQ-VAE decoder.A Score Model (Methods) selects the most promising generated structures, which are then further refined by minimising a coarse grained molecular energy function [13].
https://doi.org/10.1371/journal.pone.0297105.g001used for training a Deep Generative Model that takes sequence information and masked targets as input.The task of the Generative Model was to predict the probabilities of the masked distance classes.For the masking, we first selected the fraction of nucleotide pairs (pixels) to be masked by randomly drawing an integer number n from the set {1, 2,.., (L × (L − 1)) 2 /2}, with L the sequence length, and then randomly selecting n out of (L × (L − 1)) 2 /2 pixels whose onehot encoded target values were then overwritten by assigning each distance class the same value.We only need to select up to (L × (L − 1)) 2 /2 pixels since distance matrices are symmetric.Training neural network architectures with masked targets on input shows surprisingly strong generalisation behaviour and has resulted in state-of-the-art results for learning words representation in Natural Language Processing (NLP) and for image generation in computer vision [34][35][36][37].After training, a structure can be iteratively built up by sampling a distance class for each nucleotide pair according to a MCTS search algorithm (Methods) and presenting the selected distance class at the input (Fig 1).Although our generative model allows to estimate the likelihood for each predicted structure by making use of the chain rule for probability mass functions [36], this value is in general unreliable [38].We therefore trained a Score Model (Methods) that allowed to score the match between sequence and generated structures, similar to a value function in reinforcement learning [39].Each predicted, one-hot encoded distance matrix with high score was mapped back to an Euclidian distance matrix, using the decoder of the VQ-VAE.The Euclidean distances were further fine tuned by minimising a coarse grained, physical RNA energy function [13].
In our approach we included some best practices for training deep neural networks.First, deep neural networks strongly benefit from end-to-end learning, where gradients for updating parameters are allowed to propagate from the objective function back to the input, thereby avoiding extensive preprocessing steps that might reduce the information content [40].Second, the inductive bias induced by the network architecture should match the structure of the data.We therefore combined self-attention layers (Supplementary Information) to extract long-range interactions within the RNA sequence and used convolutional layers to predict local correlations in the RNA 3D structure [41,42].Third, the final performance of a deep learning model depends significantly on (i) the neural network size, (ii) the amount of training data, and (iii) the training time.The generic empirical observation, which is also confirmed in this work, is that increasing (i)-(iii) increases the prediction accuracy [43].Consequently, we employed advanced data augmentation techniques, which allowed us to train larger networks that were able to model more complex mappings and achieve better generalisation.

Data extraction and preprocessing
For the construction of our training, validation and test set we extracted 2581 RNA molecule entries from the RNAsolo database [44] with a resolution of less than 4 Å.From that, we distilled out structures that are non-redundant RNA only folds and split large complex structures into their single chain components.We discarded RNA structures in complex with protein/ DNA and multichain RNAs and removed sequences with non A,G,C,U content.By including RNA structures of the RNA-Puzzles challenges from the PDB that were missing in RNAsolo, we arrived at 1454 RNA single chain structures.The extracted structures were grouped according to their Bowling Green State University (BGSU) RNA class membership [45], on which we performed hierarchical clustering based on sequence similarity, as high sequence similarity typically implies high structural similarity.Clusters were build based on a sequence similarity cutoff of 0.7.The test set was build from clusters that comprise the RNA-Puzzles.The complete dataset was split on cluster level into training, validation, and test set, with cardinality 1127, 327, and 78, respectively.We perform such a stringent splitting to detect overfitting, as larger deep learning models can memorise structures based on sequence input.To augment the structural data, we carried out Molecular Dynamics (MD) simulations [13] for each of the 1127 sequences that were initialised by the atom positions of structural variants that correspond to the same PDB id (NMR ensembles or symmetrical copies of biological assemblies).We ran the simulations independently 5 times with a varying number of time steps to obtain "drifted structures" with about 3 Å root-mean-square error (RMSE) to the original PDB structure.The drifted structures induce noise on training targets, similar to label smoothing [46,47], which is a regularisation technique that has been introduced to avoid overconfident predictions.From the 1454 structures, 467 had a sequence length L � 100nt.We used the sequences of length L � 100nt to generated additional structure-sequence pairs by randomly cropping them to length L = 100nt.For each cropped structure, we memorised the contact nucleotides, which are those nucleotides of the crop that have less than 3.3 Å distance to the remaining nucleotides of the original structure.We included only crops in the training dataset with less than 5% contact nucleotides.This very stringent cutoff reduces the bias of training examples towards contact constrained folding structures.We used a binary indicator variables to mark all possible pairs of contact nucleotides and showed the corresponding distance classes for these pairs as fixed input during training.As cropped structures made up most of the training set, the model effectively learned to predict substructures that were constrained by the remaining part of the RNA structure.To predict free folding RNA structures we take the trained prediction model for the cropped structures and set the binary indicator variables to zero.This data augmentation approach is similar to the concept of non-leaking data set augmentations [48].As our dataset contains large structures with length up to 1513 nucleotides, random cropping results in strong data augmentation with a total of 6245 unique structures.After generating drifted structures for each unique structure, the augmented training, validation, and test sets comprise 27644, 3270, and 78 structures, respectively.We determined the position of each nucleotide by 5 selected atoms (Fig 1 ) and compressed the 25 possible real distances between the selected atoms for any nucleotide pair into K = 3 distance classes using a Vector Quantised Variational Autoencoder (VQ-VAE).

Autoregressive generative model
The generation of a 3D structure, s, from sequence information, x, was carried out iteratively by first selecting a nucleotide pair (pixel) with index i 2 {1,.., N} from the N = L(L − 1)/2 possible pairings and subsequently selecting a distance class k i 2 {1,.., K} according to the class probabilities predicted by the Generative Model, P(k|s t , x).The selected distance class was then one-hot encoded, resulting in an updated input structure s t+1 s t .We denote by a t ¼ k i t the "action" for the t-th iterative step in the generation process and defined the current structural state by s t = (a t , . .., a 1 ).Actions are never overwritten during the generation process, which starts from an empty set of actions s 0 by "masking" all pixels as defined below.The Generative Model was realised by a feed-forward neural network P(k|s t , x) = ∏ i P i (k i |s t , x) that predicted probabilities for all distance classes k = (k 1 , k 2 , . .., k N ) in parallel, given the previous actions s t .The conditional independence of the predicted class probabilities is a consequence of the deterministic network architecture, where the outputs (class probabilities for each pixel) are uniquely determined by the input.The complete input feature map of the autoregressive model comprised of (i) a one-hot encoding of the 16 possible nucleotide pairs (AA,AC, . .., UU) for each pixel, (ii) the already set distance classes in the autoregressive process, s t , (iii) coordinate frames that included the diagonal as symmetry axis and padded regions for sequences of length L < 100, (iv) the output of a Self-Attention layer [42] that takes structural probing data and homologous sequences as input (Fig 1 ).Training was carried out by showing for s t a random fraction of distance classes (masked target) at the input, where the one-hot encoding for the K distances classes was substituted by information that was related to the target logits, e.g.(1, −1, −1) if the target was the first distance class and (0, 0, 0) if the target class was masked.The number of masked distance classes shown at the input was distributed according to a truncated half-normal distribution, to enforce that almost complete targets are shown less frequently during training.The feed-forward network architecture was a residual network with 16 residual blocks and 26 channels in each hidden layer, trained by early stopping (Supplementary Information).The generative model allowed to compute likelihood estimates of a structure s N for a given nucleotide sequence x by making use of the chain rule of probability, Pðs N jxÞ ¼ Q N t¼1 Pða t js tÀ 1 ; xÞ.

Score model
The likelihood estimate was improved by learning a Score Model Dðs N ; s 0 N ; xÞ (discriminator) that was trained to distinguish between correct and incorrect distance-map/sequence pairs by maximising the objective and f ðs 0 N ; xÞ being the scalar output of a deep neural network.The theoretically optimal solution f ðs N ; xÞ ¼ log P true ðs N ;xÞ P false ðs N ;xÞ is typically not reached by optimisers based on stochastic gradient decent [49].Here, "true" corresponds to original PDB examples and "false" to PDB examples with drifted atom position using Molecular Dynamics Simulations [13] under high temperature and encoded by the VQ-VAE or distance-maps predicted from the Generative Model.The discriminator compares two complete distance-maps with respect to their match to a given sequence, which is in contrast to the absolute likelihood estimate from the chain rule of probability.The value f ðs 0 N ; xÞ is used to rank the predictions that are sampled from the Generative Model.For the Score Model, we used a Residual Network architecture with 8 residual blocks, where blocks were connected by down-sampling layers using stride 2 convolutions (Supplementary Information).

Structural sampling
Unlike proteins, RNAs frequently fold into different structures under physiological conditions.To identify the structures that occur with high probability, we had to sample the large combinatorial space of possible distance-maps and rank them according to their corresponding likelihood.For RNAs of length L = 100 nucleotides, the combinatorial space of allowed distances is given by K L(L − 1)/2 > 10 2000 for K � 3 and thus exceeds the number of possible games, *10 700 , that can played in the board game Go.To realise fast sampling, we borrowed search strategies from reinforcement learning (RL).We aimed to find the best sequential ordering (a t , . .., a 1 ) for presenting distance classes at the input, such that after a minimum number of autoregressive steps the probability masses for the remaining nucleotide distances became highly concentrated into one class.This ordering allowed to predict the final distance map after T � N steps by selecting the most likely distance classes for the set of remaining masked pixels, M T , in parallel

Pðk j js T ; xÞ
To find a close to optimal autoregressive ordering, we developed a variant of the Monte Carlo Tree Search (MCTS) that accounts for the fact that some actions affect the global distance-map and hence structure more that others.For a given nucleotide sequence, a tree of possible RNA distance-maps can be built iteratively by connecting incomplete distance-maps (nodes), s t , with actions, a t , such that each leaf node s L of the current tree can be reached by a unique path of actions s L = (a L , . .., a 1 ).For selecting the actions to reach a leaf node from the root node (empty set of actions), s 0 , we followed the selection rule (policy) [ For leaf nodes that were terminal nodes, s L 2 S T , we updated N(s t , a t ) N(s t , a t ) + N expl , with N expl = 10 a hyperparameter, along the path and leave Q(s t , a t ) unchanged to enforce exploration of different distance-maps and thus structures.

Structural ensemble
We generated complete distance-maps for the set of terminal leaf nodes S T by argmax sampling from P(k|s T , x), as introduced above.We ranked the complete distance-map relative to each other according to the discriminator output.The resulting ensemble was a subset of the possible distance-maps that an RNA can take.As we started out to the find the most likely distance-map by MCTS, with alternative distance-maps as a by-product of the search, the resulting ensemble was highly biased towards distance-maps with high likelihood.

Refinement
For the RNA-Puzzles challenges (Fig 2 ) we carried out refinement steps using a coarse grained force field method.For the sampled structural ensemble we computed 1000 simulation runs using SimRNA.These simulation runs typically cluster near local optima.We used the built-in SimRNA clustering function based on all Molecular Dynamics traces with an energy cut off at 25% and computed five spectral clusters based on 5.0, 10.0, 15.0, 20.0 and 25.0 Å pairwise distance between clusters.The cluster centers represent candidate structures from which the one with highest likelihood was chosen for the retrospective evaluation and all 5 candidates were submitted to the blind RNA-Puzzless prediction challenge (Supplementary Material).

Results and discussion
We first tested the reconstruction accuracy of the VQ-VAE as a function of the number of distance classes (Fig 2a).For K = 8 classes, the reconstruction error approached the average experimental resolution of 2.8 Å root-mean-square error (RMSE).For K = 3 classes the median reconstruction error was still in the range of the best predictions with 4 Å RMSE.Next, we https://doi.org/10.1371/journal.pone.0297105.g002simulated a blind test by evaluating the accuracy on a held out test set that is given by the crystal structures of the RNA-Puzzles challenges [51] (Fig 2b and 2c), including only free folding RNA structures.
We observed significant improvements for RNA structure prediction problems that were classified as difficult [7] (RNA-Puzzles 7, 27, and 28) due to their longer sequence (>100nt) and their lack of homology to known structures (e.g.6ufm in Fig 2d, S3 Fig).The difficulty can be seen that all competing puzzle submissions have a high RMSD on the choosen examples, which is also true for our approach.We have also participated in three blind predictions (7mlx, 7eoj, 7elq) under the submission name "Dfold" with single digit RMSD reconstruction error, which is in line with our retrospective evaluation.The blind predictions confirm that our deep learning based approach can indeed predict new structures and is not only memorizing the training data.
To investigate the effect of structural probing data (SHAPE), we used a force field model [13] to simulate SHAPE reactivities which are shown at the input during training.We thereby assumed that single stranded RNA is more flexible than double stranded RNA and thus shows higher mean squared displacement (MSD) of atoms during the force field simulations.The simulated MSD values show good agreement with the experimentally determined SHAPE reactivities (Fig 2e).We observed a small improvement in prediction accuracy on average when we presented both SHAPE data and MSAs of homologous sequences at the input, which indicates that the additional constraints imposed by simulated SHAPE data and evolutionarily constrained nucleotide-nucleotide interactions provide only little additional information to our model for RNAs of length L � 100nt.However, we found that this additional input data allowed to infer global structural information (S2 Fig), thereby accelerating MCTS.This acceleration might become crucial in cases where exploring the global structural space by MCTS is the limiting factor.
The ability of our approach to find alternative structures can be used to predict the different states of riboswitches (Fig 2f).We simulated a blind prediction test by removing all homologous structures from the training set for the Glutamine Riboswitch (PDB: 5DDO) and the ZMP Ribosowitch (PDB: 4XW7).The predicted alternative structures, which represent two highest ranked branches of the MCTS by the Score Model, confirm the general viewpoint that riboswitches work by a ligand mediated stabilisation of one structural conformation.
Despite the strong advances in protein structure prediction using Deep Learning approaches, RNA 3D structure prediction remains challenging for longer sequences.The reason can be attributed to the limited number of experimentally determined structures in public databases (* 10 2 less structures for RNAs as for proteins) and the fact that some RNAs can fold in different structural variants under physiological conditions.The existence of an ensemble of possible RNA structures cannot be appropriately addressed by deterministic feed-forward network architectures within deep learning approaches that assign each sequence exactly one structure [24].We therefore employed a structural sampling approach that combines a deep generative model with an efficient search method through structural space.Our approach provides a more efficient way of generating structural candidates than fragment assembly approaches.This higher sampling efficiency may become crucial for longer RNA sequences, where sampling the combinatorial space using structural elements becomes computationally prohibitive.On the downside, predictions by deep neural networks frequently violate physical constraints and require additional relaxation steps to generate valid structures with atom resolution [24].For that, it is worth exploring additional, e.g.stereochemistry data to stack next to the SHAPE matrix for an uplift in prediction performance as validated in [53].We therefore expect that combining more RNA specific datasets and the use of an advanced score model with atom resolution, such as ARES [16], to relax the outcome of a deep generative model is a promising strategy for further improvements of RNA 3D structure prediction.We have also tested our model under various data splits and need to highlight that without a rigorous training, validation and test set split, deep learning based models are highly exposed to overfitting, especially given the small amount of experimentally determined RNA 3D structures available.

Data preparation and preprocessing
Source data.For the construction of our training, validation and test set we extracted 2581 RNA molecule entries from the RNAsolo database [44] with a resolution of less than 4 Å.From that, we distilled out structures that are RNA only folds and we splitted large complex structures into their single chain components.We further added additional RNAs that are not availaible in RNAsolo directly from the PDB (especially from the RNApuzzles) and cleaned the so derived raw dataset: In particular, we only kept a structure if (i) it had only RNA chains (entry-type in pdb_entry_type.txtwas "nuc"), (ii) the resolution method was either X-ray diffraction or NMR ("method" in pdb_entry_type.txtwas either "diffration" or "NMR"), (iii) the PDB structure was downloadable from http://files.rcsb.org/download/,(iv) the "resNames" were either A, C, G or U, (v) the chain had at least 14 nucleotides, or at least 7 nucleotides if the structure was composed of multiple chains, (vi) for structures with 3 or more chains, all chains were given by unique sequences.
Thus, we derived with 1454 RNA single chain structures.The so extracted structures were grouped according to their BGSU class members [45] to not have same BGSU class members from the training set in the validation and test set.To further enforce a structural difference between the training and validation and test set, we enforce a splitting with an analysis of sequence similarity, using hierarchical clustering with a similarity cutoff of 0.7.We split the clusters into training, validation, and test sets (1127, 327, and 78 respectively).All RNA-Puzzles are kept in the test set to have a represantative blind prediction simulation.The validation set is derived from all class members and clusters of similar sequence from the puzzle test set so that no class members from BGSU as well as class members of similar sequence are in the training set.The training set only contains pdb entries, that have no class and similarity member in the validation and test set and hence, can be considered as structural different, so that the networks can not memorize the structures.If we do not perform such a stringent splitting we experience much stronger benchmark results, pointing to the fact that deep learning models can just memorize structures based on sequence input.
Neural network input data format.All PDB structures were converted into a reduced 5-atom positional representation for each residue [13] (Fig 1).For Guanosine monophosphate and Adenosine monophosphates, we used the P, C4', C2, C6 and N9 atoms.For Cytosine monophosphates and Uridine monophosphates, we used the P, C4', C2, C4 and N1 atoms.To distinguish between purine and pyrimidine residues (G/A and C/U, respectively), we encoded the cartesian coordinates (x, y and z) of the 5 atoms by an 8 × 3 coordinate matrix and indicated the valid atoms (rows) by an 8 × 1 mask array, see Table 1.
When atoms were missing in a structure-which occurs frequently for the leading phosphate group-the corresponding coordinate and mask values were set to zero.The two atoms that determined the position of the phosphate backbone (first two rows in coordinate matrix) are shared between purines and pyrimidines.While both Purines and Pyrimidines have C2-atoms, this atom does not occur at the same position in the nucleobases' ring structures and hence was encoded separately (see red frame in Fig 1).
Distance and mask tensors.For an RNA sequence of length L, we computed for all of the L × L possible residue pairs the Euclidean distances between the encoded atoms.The resulting 8L × 8L distance matrix was re-shaped into a L × L × 64 distance Tensor D and symmetrized to satisfy the symmetry condition D ijk = D jik .For example, D 1L1 = D L11 is the Euclidian distance between the phosphate atoms of the first and the last residue.
Some atom-atom distances in this distance tensor D, however, did not correspond to any meaningful values because the corresponding atoms were missing and thus the coordinate entries in the 8 × 3 matrices were zero.The respective elements in D were set to zero by multiplying distance tensor D with a mask tensor M that was calculated as follows.First, the mask arrays of size 8 × 1 (Table 1) for the L residues were stacked to a single array of size 8L × 1 and the outer product with itself was calculated to obtain a matrix of size 8L × 8L.This matrix was then reshaped into a tensor of size L × L × 64 and symmetrized to obtain the mask tensor M.
Structures with long or multiple chains.When a structure had multiple chains or a chain's length exceeded 100 nt, we carried out the following steps to obtain multiple, smaller substructures suitable for model training.First, when a structure had multiple chains, a chain was randomly selected with probability proportional to its length and used as a substructure.Then, if that chain's length exceeded 100 residues, a random, continuous subsection of that chain was cut out and used as the substructure instead.For each residue in this substructure, the distances to the residues of the remaining, overall structure were calculated.For distances below a threshold of 3.3 Å, the corresponding residues of the substructure were flagged as "fixed" and the corresponding distance classes between "fixed" residues were presented at the input during training of the generator.
Data augmentation.SimRNA is a 3D RNA structure prediction software that makes use of coarse-grained residue representations and Molecular Dynamics Methods to sample the conformational space [13].This program starts with a circular initial RNA structure (that resembles a snake biting it's own tail) and then folds the RNA to minimize an energy function while slowly cooling down the thermodynamic system.To do data augmentation, we used SimRNA the opposite way: we started with the original PDB structure and then increased the temperature to "drift away" from the original structure.Using this method, we generated 100 of such "drift structures" for each training example that were approximately 1, 3, 5 and 10 Å RMSE away from the original PDB structure.
Simulated SHAPE data.SHAPE (selective hydroxyl acylation analyzed by primer extension) [52] is a method for obtaining RNA secondary structure information.SHAPE exploits the fact that RNA residues that do not engage in base pairing, such as residues in dangling ends or loops, react more easily with certain reagents, making their detection possible by primer extension.In other words, higher SHAPE reactivities indicate regions of higher RNA flexibility.SHAPE data generated under comparable experimental conditions is only available https://doi.org/10.1371/journal.pone.0297105.t001 for a very limited number of PDB structures [52].We hence used above mentioned drift data (see "Data augmentation") to simulate SHAPE data.Specifically, we structurally aligned original PDB structures and their drift structures and used the average absolute position-specific deviation (that is, the RMSE between individual atoms) as simulated SHAPE data.Experimental SHAPE data.We used publicly available SHAPE data [52].As the simulated SHAPE data only correlates with but does not exactly match the experimental SHAPE data, we rescaled the experimental SHAPE data.For this purpose, we mapped all data percentiles between experimental and simulated SHAPE data (for example, a SHAPE value that fell into percentile 10 among the original SHAPE data was mapped to the 10 th percentile of the simulated SHAPE data).This mapping was learned using the following PDB IDs: 2L1V, 2K95, 3PDR, 3DIG, 1P5O, 3G78 and 2N1Q.
Similarity clustering.Structure is generally more conserved than sequence, hence structural similarity should ideally be used to split the data set into training, test and validation data sets that share as little homology as possible.However, similarity scores based on RMSE have the problem that global rearrangements dominate this score even if all structural domains are conserved perfectly.We therefore decided to use sequence similarity as a proxy for structural similarity and hence functional similarity.We aligned all sequences of all chains in the data set with a scoring function that rewarded matches with 1 and punished mismatches and opening gaps with -1.Gap extensions were treated as neutral.Then, we summed up the scores for structures that had multiple chains, and divided the scores by the overall length of the longest of each compared sequence.Using these length-specific matching scores, we used complete hierarchical clustering and chose a reasonable cutoff (0.7) to obtain sequence clusters.
Data sampling.First, we sampled uniformly over all sequence clusters.Longer sequences are more informative than shorter sequences, hence we sampled proportional to sequence length.
Homologs.To obtain sequences homologous to those in our PDB structure data set, we followed the following workflow.First, we downloaded homologous sequences from RNAcentral [54] using a Python script published on the RNAcentral Sequence search API website at https://rnacentral.org/sequence-search/api.Then, we only kept homologous sequences that fulfilled the following conditions (i) their E-value had to be less than 0.01, (ii) they must not have insertions, and (iii) they had to have at least one mutation other than a deletion.Gaps in homologous sequences were filled up with the letter 'N'.

Deep learning architectures and hyperparameters
VQ-VAE network architecture.We encoded RNA tertiary structures by distance tensors (L × L × 64)-as explained above-that act as both input and target of our VQ-VAE.We zeropadded all entries of distance tensors that were outside the maximum sequence length L = 100.Following the original work of the VQ-VAE [33], the encoder architecture was a residual network [55] that consisted of one convolutional layer followed by 4 residual blocks.Each residual block was of the form: Block = 2x[Batchnorm, ReLU Activation, Conv].The two convolutional layers ("Conv") within each block could have different convolutional kernels, whose sizes we report in the following format hereafter: [hight × width, out_channels], with block and kernel settings as in Tables 2 and 3 In the vector quantisation step, the encoder's residual network output of shape (L × L × 8) was mapped to VQ-VAE encodings (distance classes) of shape (L × L × 3) using 3 codebook vectors of embedding dimension 8.We enforced symmetry of the distance class tensors (along the first 2 dimensions) by adding the transpose of the embedding and dividing the result by 2.
The decoder architecture took the VQ-VAE encodings (distance class tensor) as input and consisted of four sets of residual blocks followed by a final ReLU layer.
Between each set of residual blocks, we up-scaled the number of feature-maps by inserting an additional convolutional layer that doubled the number of feature maps.The output of the decoder had shape (L × L × 64) and corresponded to the distance tensor reconstruction.For training, we used the same learning objective as in the original VQ-VAE [33] using an exponential moving average for the codebook vector updates with a decay rate of 0.99.For optimization, we used standard Adam Optimizer [56] with a learning rate of 1 × 10 −5 at a batch size of 100.
Generator network: Data preprocessing.We stacked the following 4 tensors to obtain input tensors for the generator network: (i) an encoded RNA sequence tensor, (ii) the corresponding, partially masked distance class tensor, (iii) coordinate frame tensors to provide the network with positional information, and (iv) an optional attention map of homologous sequence alignment and SHAPE data.RNA sequences of length L were encoded as unique bit patterns of shape (L × L × 8), (see S1 Fig) .First, an RNA sequence was one-hot encoded by a tensor of shape (L × 4).Unknown nucleotides that were denoted by an "N" in the sequence were encoded by setting all values in the one-hot encoding to 0.25.This one-hot encoded sequence was then copied L − 1 times to obtain a tensor of shape (L × L × 4).Then, this tensor and its transpose were stacked along the last dimension to obtain a tensor of shape (L × L × 8).This tensor corresponded to a unique bit pattern for each possible pairing and also contained directional information.For sequences with L < 100, the sequence tensors were uniformly padded with −1.
Partially masked distance class tensors were obtained from the distance classes using a pretrained VQ-VAE, followed by a "partial masking" process in which some pixels were set to zero as described in the following.First, we drew the number of pixels to be masked at a given training step from a truncated normal distribution with mean L 2 /2, standard deviation L 2 /4 and which was bounded at 2 standard deviations around the mean (this ensured that at least 0 and at most L 2 pixels were masked, while rarely masking either very few or very many pixels).Then, we randomly drew the corresponding number of pixel positions and encoded masked pixels with (0, 0, 0).However, we found that encoding masked pixels this way made it difficult for the network to learn the difference between masked distance classes (0, 0, 0) and the regular, non-masked distance classes (1, 0, 0), (0, 1, 0) or (0, 0, 1).To overcome this limitation, we encoded all regular, non-masked distance classes by setting all zeros to −1, so that, for example, (1, 0, 0) was encoded as (1, −1, −1).Coordinate frames consisted of a "diagonal frame" that had ones along the diagonal and was zero elsewhere, and a "padding frame" which contained a box of side length L that had ones along its border and was zero elsewhere.The coordinate frames were padded with −1 when sequences were shorter than 100 nt.The production of the attention map of homologous sequences and SHAPE data is described in the following section in detail.Overall, the generator input was a tensor of shape (L × L × 14), and contained the encoded sequence tensor of shape (L × L × 8), the partially masked distance tensor of shape (L × L × 3), both coordinate frames of shape (L × L × 2) and the homologous sequences alignment and SHAPE data attention map tensor of shape (L × L × 1).
Generator network: Attention map of homologous sequences and SHAPE.For each training example, an array of 50 randomly chosen, aligned and one-hot encoded homologous sequences was produced.While the 4 standard nucleotides were encoded using one-hot coding (e.g."A" was encoded as (1, 0, 0, 0)), gaps and unknown nucleotides were encoded by setting all possible one-hot coding values to 0.25 such that the one-hot coding vector became (0.25, 0.25, 0.25, 0.25).When there were fewer than 50 homologous sequences, the original sequence was used to "fill up" the array.The resulting array of one-hot encoded homolgeous sequences had shape (L × 50 × 4).Using a standard dense layer, this array was mapped onto a tensor of shape (L × 50).Then, simulated SHAPE data was included by stacking a vector containing L SHAPE reactivity values on that array, resulting in a tensor of shape (L × 51).Using two separate dense layers, two tensors of shape (L × 64) were produced and self-attention was applied to these tensors by using them as query and key tensors as described in the original Transformer paper [42], resulting in an attention map of shape (L × L × 1).Self-Attention added the benefit, that for every pixel, the corresponding nucleotide could receive both homologous sequence information as well as SHAPE reactivity information from all other nucleotides in that sequence.We computed results on a sample attention map for the Adenine Riboswitch (PDB: 1y26) in S2 Fig.
Generator network: Architecture.Using the (L × L × 14) input tensor described above, the generator network performed regression on the full distance class map under the masked learning objective, which is specified further below.The input tensor was first passed through a convolutional layer to be further processed by a deep residual network architecture with eight blocks, each one having the following structure: ResBlock = 2x[Batchnorm, Elu Activation, Conv] + 1x[Batchnorm, Elu Activation, Conv] + 1x[Batchnorm, Elu Activation, Dilated-Conv], with standard skip connections after the first two and between the last two subblocks.We employed standard convolutions with kernel sizes as described in Table 4 below.We also used dilated convolutions with a dilation rate of 2 in the residual blocks.Here, the last convolution layer compressed the residual network output into an appropriate shape of (L × L × 3) so that, after a final softmax activation, the network's output corresponded to predicted distance classes, see Table 4.
The network was trained under the masked learning objective, implemented using the cross entropy between target distance classes and generator predictions coming from input tensors with partially masked distance tensors as a loss function.We further employed weight decay (L 2 regularization) following standard recommendations for training large residual architectures [55].Table 5 shows the generator hyperparameter setup using a lazy Adam Otimizer.
For stabilized training, we chose a learning rate with linear warmup scaling and cosine decay.After five million iteration steps we stopped the network training.
Score model.The Score Model was implemented to distinguish which one of two distance class maps was better.Its architecture as described in Table 6 was a residual network using optimization hyperparameters as further described in Table 7.
The Score Model was trained to discriminate between "correct" and "incorrect" distance class maps as described in the following  distance class maps for all original as well as drift structures.We further increased the dataset of incorrect examples by factor 10 by randomly flipping class pixels from the targets distance classes.We also used the Generator networks' argmax predictions, showing only 0%, 5%, 10%, 15% and 20% of the target distance classes at the input.Then, at each training step, both a correct and a corresponding incorrect distance class map of shape (L × L × 3) were fed through the network to obtain two "logit maps" of shape (L × L × 1).The values of the correct logit map were then subtracted from corresponding values of the incorrect logit map, followed by taking the sum over all differences.We optimized this objective with L 2 regularization under the following hyperparameter setup and used a standard learning rate decay exponential to the iteration steps.MCTS: Sampling structural ensembles.The MCTS algorithm sampled pixelwise one of the three distance classes using the Generator network iteratively.Naturally, the diagonal had high class probabilitiy values, so that we started initialising the search tree by setting the diagonal to the nearest distance class.Except for the diagonal, we started by masking all (L × L × 3) target distance classes with zeros and iteratively filling in class indicators, e.g.(1, −1, −1) if the MCTS chose the first distance class.With that basic step logic, MCTS could iteratively fill up pixels and update visit counts and values at each node.The value objective in the MCTS was designed such that the network aimed to get a sharp view after some pixels were set, so that not every single of the L 2 /2 pixels needed to be sampled.This reduced the depth of the search tree by a large margin.Typically, the Generator network produced sufficiently sharp predictions when the MCTS was able to predict 30% of the target distance classes correctly.The remaining pixels were then filled up using an argmax prediction of the Generator network given that leaf.Exemplarly, we computed the leafs of the search tree for two alternative structures shown in S3 Fig for the ZMP-Riboswitch (PDB: 4xw7).
Refinement.The SimRNA simulation runs are carried out under the standard configuration file:

Fig 1 .
Fig 1.Data flowchart for the RNA structure generation process.The PDB structure is represented by Euclidean distances between nucleotide pairs and there are eight atom types for the RNA nucleotides, see also Table1.The position of each different nucleotide is determined by five out of eight selected atoms, resulting in an 8 × 8 matrix for each single nucleotide which can be flattened into a vector.The resulting L × L × 64 Euclidean distances for all nucleotide pairs are encoded into L × L × K discrete distance classes by a VQ-VAE[33], we choose K = 3.The generation process uses a Deep Neural Network (DNN) to predict probability values for the distances classes of shape L × L × K, so that for each pixel a via softmax a probabilitiy distribution of the K distance classes is learned.From these predictions a single distance class for a single nucleotide pair is selected according to the MCTS policy (Methods) to iteratively generate a path in the search tree.At each iteration the currently selected distance classes and the sequence information are presented as input to the DNN.Once all distance classes are selected, the Euclidean distances can be recovered by the VQ-VAE decoder.A Score Model (Methods) selects the most promising generated structures, which are then further refined by minimising a coarse grained molecular energy function[13].

Fig 2 .
Fig 2. Results: Evaluation of the structural predictions.A, Reconstruction error resulting from encoding and decoding RNA 3D structures of the test set as a function of the number of distance classes, B,C Simulated blind tests of the RNA-Puzzles Challenges in comparison to other approaches, with or without simulated SHAPE reactivity data and homologous sequences as additional input.The three right most predictions are real blind submission from the latest puzzle round 33 (7mlx, 7eoj, 7elq).Puzzles 6ufm, 6pom, and 6pmo are large compounds of tRNA-Riboswitch complexes.Puzzle predictions are sorted in descending order of the sequence length (left to right) D, Reconstruction error of longer RNA-Puzzles that lack both structural and sequence homology (PDB 4R4V: 185 nt,) in comparison to (PDB 4QLM: 108 nt, PDB 6pom tBox: 75, PDB 6ufm Complex: 175 nt) for which structural and sequence homologs are available.Structures of length >175 nt were cut into substructure of length �100 nt, aligned with PyMOL, and RMSD calculated as average over all substructures.E, Simulated chemical probing data in comparison with experimentally measured reactivities [52] (SHAPE) for PDB 1Y26, F, Most likely alternative structures, as predicted by the Score Model, for the Gluatamine riboswitch and the ZMP riboswitch (S3 Fig).
. The set of correct examples consisted of all 8048 examples in the training data set.To create the set of incorrect examples, we used SimRNAbased data augmentation to compute 100 drift structures of up to 10 Å RMSE away from each original structure in the training data set.Using the VQ-VAE setup, we then computed The computation of the five spectral clusters required the following steps: Example: 7elq: GGAGUAGAAGCGUUCAGCGGCCGAAAGGCCGCCCGGAAAUUGCUCC INPUT: Sequence 1. MCTS: sample structural ensemble using generative network Output: N structures python mcts.pyGGAGUAGAAGCGUUCAGCGGCCGAAAGGCCGCCCGGAAAUUGCUCC 2. SimRNA: for i in 1. ..N: ./SimRNA7elq_i.pdb-c config.dat# compute 1000 repition runs # -> N*1000 trafl files / pdb outputs # -> concatenate all trafl files Output: 7elq.trafl 3. SimRNA Clustering ./clustering7elq.trafl0.25 5.0 10.0 15.0 20.0 pdb files (cluster centers) particular example, the strechted grey RNA structure C was ranked with a lower score by the Score Model.(PDF) Uðs t ; aÞ ð Þ ; Uðs t ; aÞ ¼ c p ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi P the tree is enlarged by |S H | nodes in parallel, initialising N(s L , a) = 1 and Q(s L , a) = v(s L+1 |x) for all a 2 S H , with value function the entropy reduction rate L} the visit count N(s t , a t ) N(s t , a t ) + |S| and subsequently the expected reward Qðs t ; a t jxÞ Qðs t ; a t jxÞ þ jS H j Nðs t ;a t Þ ðvðs * Lþ1 jxÞ À Qðs t ; a t jxÞÞ, using the 'winner takes it all' value function vðs * L jxÞ, with s * L ¼ argmax a L 2S H vðs L jxÞ.
a Qðs t ; aÞ þ a Nðs t ; aÞ p 1 þ Nðs t ; aÞ with Q(s t , a) the expected entropy reduction of P(a|s t , x) if action a is taken, N(s t , a) a counter how often actions that connect the root node with a leaf node pass through the state-action pair (s t , a) under the actual policy, and U(s t , a) a term that upweights rarely visited actions (exploration), with c p being a tuneable constant.After reaching a leaf node, s L , the tree is expanded by randomly selecting a subset S R of the remaining masked pixels, and from S R a subset S H � S R of actions that result in sufficiently strong entropy reduction ΔH L < λlogK for the predicted class probabilities, with λ = 1 andDH L ¼Hðs Lþ1 ; xÞ À Hðs L ; xÞ Hðs; xÞ ¼ À j js; xÞlogPðk j js; xÞ For |S H | = ; the leaf node s L is added to set of terminal nodes S T S T [s L and for |S H |6 ¼; t¼1 Hðs 0 jxÞ À Hðs L jxÞ Hðs 0 jxÞAlong the path of actions from the root node to s L = (a 1 ,.., a L ), we updated for all t 2 {1,..,